Introduction

About This Project

This document presents a complete Species Distribution Model (SDM) for the American Lobster (Homarus americanus) in the Gulf of Maine region. The workflow follows the mind-map framework provided in the course, progressing through six key stages:

  1. C00 - Setup: Loading tools and understanding spatial data
  2. C01 - Observations: Fetching and filtering occurrence data from OBIS
  3. C02 - Background Sampling: Creating pseudo-absence points for model training
  4. C03 - Covariates: Selecting environmental predictors and checking collinearity
  5. C04 - Models: Training and tuning multiple machine learning algorithms
  6. C05 - Prediction: Generating nowcasts and climate-scenario forecasts

Study Species: American Lobster

The American Lobster is an iconic crustacean species of significant ecological and economic importance in the Northwest Atlantic. It is one of the most valuable fisheries in New England and Atlantic Canada. Understanding how climate change may shift its habitat is critical for fisheries management and conservation planning.


C00: Spatial Data Foundations

Before building my SDM, I need to understand the spatial data structures I’ll be working with. This includes point data (buoy locations, species observations) and raster data (environmental variables from the Brickman model).

Study Region: Gulf of Maine

The plot below shows my study region with NOAA buoy locations that monitor oceanographic conditions:

Gulf of Maine study region showing NOAA monitoring buoys.

Gulf of Maine study region showing NOAA monitoring buoys.

Interpretation: The buoys are distributed throughout the Gulf of Maine and provide real-time oceanographic measurements. This region spans from approximately 39°N to 46°N latitude and 64°W to 76°W longitude.

Environmental Data: Brickman Model

I use the Brickman oceanographic model which provides downscaled climate projections for the Northwest Atlantic. The model includes variables like sea surface temperature (SST), salinity, and bottom conditions across different climate scenarios.

Monthly sea surface temperature (SST) from Brickman model for present conditions.

Monthly sea surface temperature (SST) from Brickman model for present conditions.

Interpretation: SST shows strong seasonal variation in the Gulf of Maine, ranging from cold winter temperatures (< 5°C) to warm summer conditions (> 20°C in coastal areas). This seasonality is crucial for understanding lobster habitat preferences throughout the year.


C01: Occurrence Data from OBIS

Data Acquisition

Species occurrence data is obtained from the Ocean Biodiversity Information System (OBIS), a global open-access repository for marine species observations. The fetch_obis() function downloads records, and read_obis() loads them for analysis.

Starting dataset: I begin with 209,167 American Lobster occurrence records from OBIS.

Data Quality Control

Raw occurrence data often contains issues that must be addressed before modeling:

  • Missing dates: Records without observation dates cannot be assigned to specific months
  • Missing counts: Records without individual counts may be duplicates or errors
  • Temporal range: I filter to records from 1970 onwards to align with the Brickman climatology period (1982-2013)
  • Spatial extent: Only records within the Brickman model domain are retained

Temporal Distribution

Distribution of American Lobster observations by year.

Distribution of American Lobster observations by year.

Interpretation: The histogram shows observation effort has increased dramatically since the 1990s. The dashed red lines mark the Brickman climatology period. I retain records from 1970 onwards to capture sufficient historical data while maintaining relevance to my environmental predictors.

Monthly Distribution

Distribution of American Lobster observations by month.

Distribution of American Lobster observations by month.

Interpretation: Observation effort varies by month, with more records during warmer months when field surveys are more feasible. This sampling bias will be accounted for in my background point generation.

Spatial Filtering

We filter observations to only include those within the Brickman model domain (ocean areas with valid environmental data):

American Lobster observations overlaid on the Brickman ocean mask.

American Lobster observations overlaid on the Brickman ocean mask.

Interpretation: Points shown are American Lobster occurrences. Any observations falling on land or outside the model domain are removed. The Brickman mask ensures I only model areas where environmental predictions are available.

Data Summary

Metric Count
Starting observations 209,167
Final observations 104,630
Records removed 104,537

C02: Background Point Sampling

Why Background Points?

Species distribution models require both presence data (where the species was observed) and pseudo-absence or background data (locations representing available habitat). Background points characterize the environmental conditions across the study area, allowing models to distinguish habitat preferences.

Using Filtered Observations

Raw Observations by Month

Spatial distribution of American Lobster observations by month.

Spatial distribution of American Lobster observations by month.

Interpretation: American Lobster observations are concentrated in coastal areas, particularly around Massachusetts, Maine, and the Bay of Fundy. Observation counts vary by month, reflecting both species behavior and sampling effort. Many grid cells contain multiple overlapping observations.

Spatial Thinning

To reduce spatial autocorrelation, I thin observations so that only one record per Brickman grid cell is retained per month. This prevents overweighting of heavily sampled areas.

Spatially thinned observations (one per grid cell per month).

Spatially thinned observations (one per grid cell per month).

Interpretation: After thinning, the observation counts are significantly reduced (compare with raw counts above). The spatial pattern is preserved, but each grid cell contributes only once per month, reducing pseudoreplication.

Dataset Total Records
Original observations 104,630
After spatial thinning 5,728

Sampling Bias Map

I create a bias map based on observation density. Areas with more observations are weighted higher when sampling background points, which accounts for non-random sampling effort.

Sampling bias map based on observation density.

Sampling bias map based on observation density.

Interpretation: The bias map highlights coastal areas (especially around Massachusetts and Maine) where observation effort is highest. By using bias-weighted background sampling, I ensure that model training accounts for this uneven sampling.

Background Point Generation

I sample background points using biased sampling, with the number of background points per month matching the average observation count:

Background points per month: 8,719

Presence points (thinned observations) and background points by month.

Presence points (thinned observations) and background points by month.

Interpretation: Red points are presence locations (thinned observations) and blue points are background (pseudo-absence) locations. Background points are distributed across the study area following the bias map weighting, ensuring representation of available habitat conditions.


C03: Environmental Covariates

Available Variables

The Brickman model provides multiple environmental predictors. However, using highly correlated variables can cause multicollinearity issues in models. I assess pairwise correlations to select an appropriate subset.

Collinearity Assessment

pairs(present)
Pairs plot showing correlations between Brickman environmental variables.

Pairs plot showing correlations between Brickman environmental variables.

Interpretation: The pairs plot reveals strong correlations between some variables (e.g., SST and Tbtm). To avoid multicollinearity, I use automated filtering with a correlation threshold of 0.65.

Variable Selection

Variables selected for modeling:

Retained Removed (collinear)
depth, month, SSS, U, Sbtm, V, Tbtm, MLD, SST Xbtm

I always include depth and month as ecologically important predictors for marine species.

Extract Covariates for Training Data

Presence vs. Background Comparison

This plot compares the environmental conditions at presence locations versus background locations:

Comparison of environmental conditions between presence and background points.

Comparison of environmental conditions between presence and background points.

Interpretation: The density plots show how environmental conditions differ between presence (lobster locations) and background (available habitat). Variables where the two distributions differ substantially are likely important predictors. For example, I might observe that lobsters prefer specific depth ranges or temperature conditions.

Save Configuration

Configuration saved with 9 predictor variables for model training.


C04: Model Training

Modeling Approach

Following the course workflow, I train four different machine learning algorithms and compare their performance:

  1. GLM (Generalized Linear Model) - Simple, interpretable baseline
  2. Random Forest - Ensemble of decision trees
  3. Boosted Trees (XGBoost) - Gradient boosting for high accuracy
  4. MaxEnt - Popular algorithm specifically designed for SDMs

Data Preparation

I apply log-transformation to skewed variables (depth, Xbtm) and convert month to numeric for modeling.

Spatial Train/Test Split

To evaluate model performance on independent data, I create a spatial block split. This ensures training and testing data are geographically separated, providing a more realistic assessment of model transferability.

Spatial block split showing training (blue) and testing (red) data.

Spatial block split showing training (blue) and testing (red) data.

Interpretation: The spatial blocking ensures that nearby points are either all in training or all in testing, preventing spatial autocorrelation from inflating accuracy estimates.

Cross-Validation Setup

Within the training data, I use 5-fold spatial cross-validation for hyperparameter tuning:

Five-fold spatial cross-validation structure.

Five-fold spatial cross-validation structure.

Interpretation: Each color represents a different fold. During tuning, each fold takes a turn as the validation set while the others serve as training data.

Model Specification

one_row_of_training_data = dplyr::slice(tr_data, 1)
rec = recipe(one_row_of_training_data, formula = class ~ .)

wflow = workflow_set(
  preproc = list(default = rec),
  models = list(
    glm = logistic_reg(mode = "classification") |> set_engine("glm"),
    rf = rand_forest(mtry = tune(), trees = tune(), mode = "classification") |>
      set_engine("ranger", importance = "impurity"),
    btree = boost_tree(mtry = tune(), trees = tune(), tree_depth = tune(), 
                       learn_rate = tune(), loss_reduction = tune(), 
                       stop_iter = tune(), mode = "classification") |>
      set_engine("xgboost"),
    maxent = maxent(feature_classes = tune(), regularization_multiplier = tune(),
                    mode = "classification") |> set_engine("maxnet")
  )
)

Hyperparameter Tuning

Hyperparameter tuning results for each model.

Hyperparameter tuning results for each model.

Interpretation: This plot shows how different hyperparameter combinations affect model accuracy during cross-validation. Higher values indicate better performance.

Select Best Models

Model Performance Comparison

Performance metrics for each model on test data
wflow_id accuracy boyce_cont roc_auc tss_max
default_glm 0.8687392 0.4265954 0.6231085 0.2011549
default_rf 0.8634139 0.2951411 0.8119974 0.5036906
default_btree 0.8687392 0.5783140 0.7215759 0.3574409
default_maxent 0.5889465 0.9755823 0.7539606 0.3904795
Confusion matrices showing classification performance for each model.

Confusion matrices showing classification performance for each model.

Interpretation: The confusion matrices show true positives, true negatives, false positives, and false negatives for each model. Better models have higher values on the diagonal (correct predictions) and lower values off-diagonal (errors).


C05: Predictions

Generating Habitat Suitability Maps

With trained models, I can now predict habitat suitability across the study area under current and future climate conditions. I use the Boosted Tree model for predictions as it typically achieves high accuracy.

Nowcast: Present Conditions

## numeric
Habitat suitability prediction for American Lobster under present conditions.

Habitat suitability prediction for American Lobster under present conditions.

Interpretation: This map shows the probability of American Lobster occurrence under current environmental conditions. Warmer colors (yellow/orange) indicate higher habitat suitability. The species shows strong preference for coastal shelf areas, with seasonal variation visible across months.

Climate Scenario Forecasts

I generate predictions under two Representative Concentration Pathways (RCPs):

  • RCP 4.5: Moderate emissions scenario
  • RCP 8.5: High emissions (“business as usual”) scenario

Each scenario is projected for years 2055 and 2075.

RCP 4.5 Projections

## numeric
Habitat suitability under RCP 4.5 climate scenario, year 2055.

Habitat suitability under RCP 4.5 climate scenario, year 2055.

## numeric
Habitat suitability under RCP 4.5 climate scenario, year 2075.

Habitat suitability under RCP 4.5 climate scenario, year 2075.

RCP 8.5 Projections

## numeric
Habitat suitability under RCP 8.5 climate scenario, year 2055.

Habitat suitability under RCP 8.5 climate scenario, year 2055.

## numeric
Habitat suitability under RCP 8.5 climate scenario, year 2075.

Habitat suitability under RCP 8.5 climate scenario, year 2075.

Comparison Across Scenarios

## numeric
## numeric
## numeric
## numeric
## numeric
Comparison of habitat suitability predictions across all climate scenarios.

Comparison of habitat suitability predictions across all climate scenarios.

Interpretation: Comparing across scenarios reveals how climate change may shift American Lobster habitat:

  • Under moderate warming (RCP 4.5), suitable habitat may gradually shift northward
  • Under high warming (RCP 8.5), more dramatic range contractions in southern areas are possible
  • The 2075 projections show greater changes than 2055, as cumulative warming effects intensify

These predictions can inform fisheries management and conservation planning for this economically important species.

Save Predictions

All predictions saved to disk for future analysis.


Summary & Conclusions

Workflow Summary

This project followed the complete Species Distribution Modeling workflow as outlined in the course mind-map:

Chapter Stage Key Functions Output
C00 Setup source("setup.R") Loaded packages and spatial data
C01 Observations fetch_obis()read_obis() Filtered occurrence dataset
C02 Background thin_by_cell()sample_background() Presence + background points
C03 Covariates filter_collinear()extract_brickman() Environmental predictors
C04 Models workflow_set()workflow_map()workflowset_selectomatic() Trained model fits
C05 Prediction predict_stars() Habitat suitability maps

Key Findings

  1. Data Quality: Starting with 209,167 records, quality filtering retained 104,630 observations for modeling.

  2. Environmental Predictors: After collinearity filtering, 9 variables were retained: depth, month, SSS, U, Sbtm, V, Tbtm, MLD, SST.

  3. Model Performance: Four algorithms were trained and evaluated using spatial cross-validation to prevent overfitting.

  4. Climate Projections: Habitat suitability predictions suggest potential range shifts under future climate scenarios, with more severe changes under RCP 8.5.

Implications

The American Lobster is a keystone species for the Gulf of Maine ecosystem and supports a multi-billion dollar fishery. Understanding how climate change may affect its distribution is critical for:

  • Fisheries management: Adjusting harvest areas and quotas
  • Conservation planning: Protecting climate refugia
  • Ecosystem management: Anticipating cascading effects on predators and prey

This analysis was conducted for JP297Dj: Ocean Forecasting - AI, Ecology, and Data Justice

Colby College, January 2026